Context¶

A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective¶

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description¶

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.
In [1]:
# importing necessary libraries
%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.simplefilter("ignore", ConvergenceWarning)

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV


# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)
In [2]:
# import dataset
hotel = pd.read_csv("C:/Users/USER/Downloads/INNHotelsGroup.csv")
In [3]:
data = (
    hotel.copy()
)  # copy data into new variable to avoid any changes to the original copy
In [4]:
data.head()  # view first five rows of the dataset
Out[4]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled
In [5]:
data.tail()  # view last five rows of the dataset
Out[5]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
36270 INN36271 3 0 2 6 Meal Plan 1 0 Room_Type 4 85 2018 8 3 Online 0 0 0 167.80000 1 Not_Canceled
36271 INN36272 2 0 1 3 Meal Plan 1 0 Room_Type 1 228 2018 10 17 Online 0 0 0 90.95000 2 Canceled
36272 INN36273 2 0 2 6 Meal Plan 1 0 Room_Type 1 148 2018 7 1 Online 0 0 0 98.39000 2 Not_Canceled
36273 INN36274 2 0 0 3 Not Selected 0 Room_Type 1 63 2018 4 21 Online 0 0 0 94.50000 0 Canceled
36274 INN36275 2 0 1 2 Meal Plan 1 0 Room_Type 1 207 2018 12 30 Offline 0 0 0 161.67000 0 Not_Canceled
In [6]:
data.shape  # the number of rows and columns in the dataset
Out[6]:
(36275, 19)
  • There are 36275 rows and 19 columns in the dataset.
In [7]:
data.info()  # concise summary of the columns of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB
  • The dataset has 19 columns.
  • The dataset has 5 columns of object datatype, 13 columns of integer datatype and 1 column of float datatype.
In [8]:
data.duplicated().sum()  # Check for duplicate values
Out[8]:
0
  • There are NO duplicate values in the dataset.
In [9]:
data.isnull().sum()  # check for missing values
Out[9]:
Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64
  • There are NO missing values in the dataset.
In [10]:
data.nunique()  # number of unique values in the columns of the dataset.
Out[10]:
Booking_ID                              36275
no_of_adults                                5
no_of_children                              6
no_of_weekend_nights                        8
no_of_week_nights                          18
type_of_meal_plan                           4
required_car_parking_space                  2
room_type_reserved                          7
lead_time                                 352
arrival_year                                2
arrival_month                              12
arrival_date                               31
market_segment_type                         5
repeated_guest                              2
no_of_previous_cancellations                9
no_of_previous_bookings_not_canceled       59
avg_price_per_room                       3930
no_of_special_requests                      6
booking_status                              2
dtype: int64
  • Booking ID can be dropped as it does not have relevant information for our analysis.
In [11]:
data = data.drop(["Booking_ID"], axis=1)
In [12]:
data.describe().T  # statistical summary of the dataset
Out[12]:
count mean std min 25% 50% 75% max
no_of_adults 36275.00000 1.84496 0.51871 0.00000 2.00000 2.00000 2.00000 4.00000
no_of_children 36275.00000 0.10528 0.40265 0.00000 0.00000 0.00000 0.00000 10.00000
no_of_weekend_nights 36275.00000 0.81072 0.87064 0.00000 0.00000 1.00000 2.00000 7.00000
no_of_week_nights 36275.00000 2.20430 1.41090 0.00000 1.00000 2.00000 3.00000 17.00000
required_car_parking_space 36275.00000 0.03099 0.17328 0.00000 0.00000 0.00000 0.00000 1.00000
lead_time 36275.00000 85.23256 85.93082 0.00000 17.00000 57.00000 126.00000 443.00000
arrival_year 36275.00000 2017.82043 0.38384 2017.00000 2018.00000 2018.00000 2018.00000 2018.00000
arrival_month 36275.00000 7.42365 3.06989 1.00000 5.00000 8.00000 10.00000 12.00000
arrival_date 36275.00000 15.59700 8.74045 1.00000 8.00000 16.00000 23.00000 31.00000
repeated_guest 36275.00000 0.02564 0.15805 0.00000 0.00000 0.00000 0.00000 1.00000
no_of_previous_cancellations 36275.00000 0.02335 0.36833 0.00000 0.00000 0.00000 0.00000 13.00000
no_of_previous_bookings_not_canceled 36275.00000 0.15341 1.75417 0.00000 0.00000 0.00000 0.00000 58.00000
avg_price_per_room 36275.00000 103.42354 35.08942 0.00000 80.30000 99.45000 120.00000 540.00000
no_of_special_requests 36275.00000 0.61966 0.78624 0.00000 0.00000 0.00000 1.00000 5.00000
  • The respective average and maximum number of adults per booking is two (2) and four (4) respectively.
  • The data does not record significant statistical summary for children except for a maximum value of 10 for a booking.
  • The most number of weekend nights recorded is 7 while the most number of weekday nights is 17.
  • The mean number of weekday nights is two (2).
  • The highest lead time recorded by a client is 443 days while the average of the same is 85 days. 50% of clients showed 57 days of lead time.
  • The most number of special requests received is 5.
  • The average price per room is 103 euros while the most expensive is 540 euros.
In [13]:
# Function needed to perform exploratory data analysis using histogram and boxplot


def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [14]:
# Function needed to perform exploratory data analysis using histogram and boxplot


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [15]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
In [16]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Univariate Analysis¶

In [17]:
# Observation on number of adults
labeled_barplot(data, "no_of_adults", perc=True)
  • A lot of bookings was made for two adults while the least of the same was for group of four adults.
In [18]:
# Observation on number of children
labeled_barplot(data, "no_of_children", perc=True)
  • 92.6% of bookings made had no children attaending. The few that did had only up to two kids in attendance.
In [19]:
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
In [20]:
# Observation on number of week nights
labeled_barplot(data, "no_of_week_nights", perc=True)
  • 31.5% of the bookings were for two week nights which is essentially the highest for this column.
  • 1, 3, 4, 0 and 5 nights followed respectively.
  • 6 - 16 nights bookings were not signifcantly made.
In [21]:
# Observation on number of weekend nights
labeled_barplot(data, "no_of_weekend_nights", perc=True)
  • 46.5% of the bookings did not include weekend nights or were not made for that period.
  • Except for 1 and 2 weekend nights, there were no significant bookings for 3-7 weekend nights.
In [22]:
# Observation on car parking space required
labeled_barplot(data, "required_car_parking_space", perc=True)
  • 96.9% of clients did not request parking space in their bookings
In [23]:
# Observation on type of meal plan
labeled_barplot(data, "type_of_meal_plan", perc=True)
  • 76.7% of bookings made included meal plan 1.
  • 14.1% of bookings did not select a meal plan nd 9.1% of bookings chose meal 2.
  • Meal plan 3 is not a choice meal plan as it records the least number of observations.
In [24]:
# Observation on room type reserved
labeled_barplot(data, "room_type_reserved", perc=True)
  • Room Type 1 is largely the choice room with 77.5% of bookings equivalent to more than 25000.
  • 16.7% of bookings included room type 4. The remaining room types had less than 5000 bookings.
In [25]:
# Observation on arrival month
labeled_barplot(data, "arrival_month", perc=True)
  • October was the busiest month with more than 5000 arrivals.
  • Except for September, the remaining months received less than 4000 arrivals.
In [26]:
# Observation on market segment type
labeled_barplot(data, "market_segment_type", perc=True)
  • The online market group had 64% (over 20000) of bookings. Except for the offline group, the remaining market segments had less than 5000 bookings.
In [27]:
# Observation on number of special requests
labeled_barplot(data, "no_of_special_requests", perc=True)
  • 54.5% of bookings had no special request associated with it.
  • 31.4% of bookings made had one special request while less than 15% had between 2-5 special requests.
In [28]:
# Observation on booking status
labeled_barplot(data, "booking_status", perc=True)
  • More than 10000 bookings were canceled compared to the almost 25000 which were not.
In [29]:
# encode canceled bookings to 1 and not canceled to 0
data["booking_status"] = data["booking_status"].apply(
    lambda x: 1 if x == "Canceled" else 0
)
In [30]:
histogram_boxplot(data, "lead_time")  # observation on lead time
  • The mean lead time is greater than the median lead time hence the right skewed distribution.
  • 50% of bookings had lead time less than the average of 100 days.
  • There are many outliers after the right whisker in this column indicating that a significant number of bookings were made in advance of more than 300 days.
In [31]:
histogram_boxplot(data, "avg_price_per_room")  # observation on average price per room
  • The mean and median values are almost equal hence the normal distribution
  • Many bookings made were close to the average cost indicating a preference for rooms in that bracket.
  • There are more outliers on the right whisker than on the left whiskers indicating more expensive rooms than cheaper ones.
In [32]:
data[
    data["avg_price_per_room"] == 0
]  # filter rows with average price per room equal to zero
Out[32]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
63 1 0 0 1 Meal Plan 1 0 Room_Type 1 2 2017 9 10 Complementary 0 0 0 0.00000 1 0
145 1 0 0 2 Meal Plan 1 0 Room_Type 1 13 2018 6 1 Complementary 1 3 5 0.00000 1 0
209 1 0 0 0 Meal Plan 1 0 Room_Type 1 4 2018 2 27 Complementary 0 0 0 0.00000 1 0
266 1 0 0 2 Meal Plan 1 0 Room_Type 1 1 2017 8 12 Complementary 1 0 1 0.00000 1 0
267 1 0 2 1 Meal Plan 1 0 Room_Type 1 4 2017 8 23 Complementary 0 0 0 0.00000 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35983 1 0 0 1 Meal Plan 1 0 Room_Type 7 0 2018 6 7 Complementary 1 4 17 0.00000 1 0
36080 1 0 1 1 Meal Plan 1 0 Room_Type 7 0 2018 3 21 Complementary 1 3 15 0.00000 1 0
36114 1 0 0 1 Meal Plan 1 0 Room_Type 1 1 2018 3 2 Online 0 0 0 0.00000 0 0
36217 2 0 2 1 Meal Plan 1 0 Room_Type 2 3 2017 8 9 Online 0 0 0 0.00000 2 0
36250 1 0 0 2 Meal Plan 2 0 Room_Type 1 6 2017 12 10 Online 0 0 0 0.00000 0 0

545 rows × 18 columns

In [33]:
data.loc[
    data["avg_price_per_room"] == 0, "market_segment_type"
].value_counts()  # show the count of rows with average price per room equal to zero and market segment type
Out[33]:
Complementary    354
Online           191
Name: market_segment_type, dtype: int64
  • The complementary market group has a bigger bracket than the online market group.
In [34]:
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(
    0.25
)  # calculate 25th quantile for average price per room

# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(
    0.75
)  # calculate 75th quantile for average price per room

# Calculating IQR
IQR = Q3 - Q1

# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR

print("The value of the upper whisker is", Upper_Whisker)
The value of the upper whisker is 179.55
  • Most rooms cost at least 80.3 euros and a quarter of those rooms cost 179.55 or more.
In [35]:
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker

Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments

In [36]:
sns.boxplot(data=data, y="avg_price_per_room", x="market_segment_type")
# boxplot on average price and market segment
Out[36]:
<AxesSubplot:xlabel='market_segment_type', ylabel='avg_price_per_room'>
  • The online market segment has the highest average price per room bookings and the corporate group having the least average.
  • There are outliers in all the market groups except the Aviation market group.
In [37]:
histogram_boxplot(
    data, "no_of_previous_cancellations"
)  # observation on number of previous booking cancellations
  • Most bookings were not canceled prior to the current cancellation.
  • There are outliers that indicate some high number of previous cancellations
In [38]:
histogram_boxplot(
    data, "no_of_previous_bookings_not_canceled"
)  # observation on number of previous bookings not canceled
  • Most orders were not canceled.

Bivariate Analysis¶

In [39]:
cols_list = data.select_dtypes(
    include=np.number
).columns.tolist()  # create array of numerical type columns

plt.figure(figsize=(12, 7))  # size configurations of the plot

# Create heatmap of numerical columns
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
# show plot
  • Generally heatmap does not show very high correlation values except 'repeated guest and number of previous bookings not canceled' which show a correlation value of 0.54, the highest in the map.

Let's see how booking status varies across different market segments. Also, how average price per room impacts booking status

In [40]:
stacked_barplot(
    data, "market_segment_type", "booking_status"
)  # bar graph of market segment type and booking status
booking_status           0      1    All
market_segment_type                     
All                  24390  11885  36275
Online               14739   8475  23214
Offline               7375   3153  10528
Corporate             1797    220   2017
Aviation                88     37    125
Complementary          391      0    391
------------------------------------------------------------------------------------------------------------------------
In [41]:
plt.figure(figsize=(10, 6))  # size configurations
sns.boxplot(
    data=data, x="booking_status", y="avg_price_per_room"
)  # create a boxplot of the two variables
Out[41]:
<AxesSubplot:xlabel='booking_status', ylabel='avg_price_per_room'>
  • The online demographic had the most canceled bookings while the complementary had the least.
  • The canceled bookings have greater mean price per room compared to the uncancelled ones.

Many guests have special requirements when booking a hotel room. Let's see how it impacts cancellations

In [42]:
plt.figure(figsize=(10, 6))  # size configuration
sns.barplot(
    data=data, x="booking_status", y="no_of_special_requests"
)  # create a bar graph of the given variables
Out[42]:
<AxesSubplot:xlabel='booking_status', ylabel='no_of_special_requests'>
In [43]:
plt.figure(figsize=(10, 6))  # size configuration
sns.barplot(
    data=data, x="no_of_special_requests", y="no_of_previous_cancellations",
)  # create a bar graph of given variables
Out[43]:
<AxesSubplot:xlabel='no_of_special_requests', ylabel='no_of_previous_cancellations'>
In [44]:
plt.figure(figsize=(10, 6))  # size configurations
sns.barplot(
    data=data, x="no_of_special_requests", y="no_of_previous_bookings_not_canceled"
)  # create a bar graph of the given variables
Out[44]:
<AxesSubplot:xlabel='no_of_special_requests', ylabel='no_of_previous_bookings_not_canceled'>
  • The number of previous bookings not canceled increased with increasing number of special requests made.
  • However, there was a random pattern for bookings that were previously canceled possibly indicating that cancelations were made for different reasons.
  • Bookings that were not canceled had more special requests than those that were canceled.

Let's see if the special requests made by the customers impacts the prices of a room

In [45]:
plt.figure(figsize=(10, 6))  # size configurations
sns.boxplot(
    data=data, x="no_of_special_requests", y="avg_price_per_room"
)  # create boxplot for no of special requests and average price per room (excluding the outliers)
plt.show()  # plot graph
  • Bookings with four special requests have the highest mean price per room. This is understandable as more requests could impact cost.
  • The average price per room seems to increase with increasing number of special requests.
  • Five special requests has the least average price per room possibly because there were not many bookings for it.

We saw earlier that there is a positive correlation between booking status and average price per room. Let's analyze it

In [46]:
distribution_plot_wrt_target(
    data, "avg_price_per_room", "booking_status"
)  # histogram and boxplot of average price per room and booking status
  • The canceled bookings have greater mean price compared to the uncancelled bookings. Clients probably changed their mind due to cost.

There is a positive correlation between booking status and lead time also. Let's analyze it further

In [47]:
distribution_plot_wrt_target(
    data, "lead_time", "booking_status"
)  # histogram and boxplot of lead time and booking status
  • The canceled bookings had greater mean lead time than the uncanceled bookings possibly because the clients did not show up at all.

Generally people travel with their spouse and children for vacations or other activities. Let's create a new dataframe of the customers who traveled with their families and analyze the impact on booking status.

In [48]:
family_data = data[
    (data["no_of_children"] >= 0) & (data["no_of_adults"] > 1)
]  # create dataframe with number of children equal to 1 and number of adults greater than 1
family_data.shape  # show number of rows and columns of the filtered dataset
Out[48]:
(28441, 18)
  • There are 28441 rows and 18 columns in the dataset
In [49]:
family_data["no_of_family_members"] = (
    family_data["no_of_adults"] + family_data["no_of_children"]
)  # create column with total number of adults and children
In [50]:
plt.figure(figsize=(10, 6))  # size configurations
sns.barplot(data=data, x="booking_status", y=family_data["no_of_family_members"])
# barplot of booking status and total family size
In [51]:
plt.figure(figsize=(10, 6))  # size configurations
sns.boxplot(data=data, x="booking_status", y=family_data["no_of_family_members"])
# boxplot of booking status and total family size
  • Both booking status categories have equal number of family size.
  • The size of the family does not significantly impact the status of the booking. This means, the canceled orders were not influenced by the size of the family that made the booking.

Similar analysis for the customer who stay for at least a day at the hotel.

In [52]:
stay_data = data[
    (data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)
]  # create dataframe with number of week nights greater than zero and number of weekend nights greater than 0.
stay_data.shape  # number of rows and columns of dataset
Out[52]:
(17094, 18)
  • There are 17094 rows and 18 columns in the dataset
In [53]:
stay_data["total_days"] = (
    stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"]
)  # create dataset with total number of days spent
In [54]:
plt.figure(figsize=(10, 6))  # size configurations
sns.barplot(data=data, x="booking_status", y=stay_data["total_days"])
# bar graph of booking status and total days spent
In [55]:
stay_data["no_of_week_nights"].sum()  # total number of bookings for weekdays
Out[55]:
43443
  • The total number of bookings for week nights is 43443 days.
In [56]:
stay_data["no_of_weekend_nights"].sum()  # total number of bookings for weekends
Out[56]:
26313
  • The total number of bookings for weekend nights is 26313 days.
  • Clients who canceled their bookings stayed longer than those that did not. However, this observation could be impacted by the count of bookings for both periods of the week.

Repeating guests are the guests who stay in the hotel often and are important to brand equity. Let's see what percentage of repeating guests cancel?

In [57]:
labeled_barplot(
    data, "repeated_guest", perc=True
)  # bar graph of repeated guest feature
  • A very small number of repeating guests cancel their bookings. This is understandable because clients clearly enjoy their stay hence their continuous return.

Let's find out what are the busiest months in the hotel.

In [58]:
# grouping the data on arrival months and extracting the count of bookings
monthly_data = data.groupby(["arrival_month"])["booking_status"].count()

# creating a dataframe with months and count of customers in each month
monthly_data = pd.DataFrame(
    {"Month": list(monthly_data.index), "Guests": list(monthly_data.values)}
)

# plotting the trend over different months
plt.figure(figsize=(10, 5))  # size parameters
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.show()  # show graph
  • October and June are the two busiest months of the year. Further investigation could be done to verify the reasons for this pattern.

Let's check the percentage of bookings canceled in each month.

In [59]:
canceled_bookings_per_month = data[
    (data["booking_status"] == 1)
]  # filter canceled bookings per month
canceled_bookings_per_month["arrival_month"].value_counts(
    normalize=True
)  # show relative count of results
Out[59]:
10   0.15818
9    0.12941
8    0.12520
7    0.11056
6    0.10862
4    0.08372
5    0.07976
11   0.07362
3    0.05890
2    0.03618
12   0.03382
1    0.00202
Name: arrival_month, dtype: float64
In [60]:
labeled_barplot(
    canceled_bookings_per_month, "arrival_month", perc=True
)  # bar graph of arrival month
  • October, September, August, July and June respectively are the top five months with the highest number of canceled bookings.
  • Their booking percentages are above 10%.

As hotel room prices are dynamic, Let's see how the prices vary across different months

In [61]:
plt.figure(figsize=(12, 8))  # size parameters
sns.lineplot(
    data=data, x="arrival_month", y="avg_price_per_room"
)  # create lineplot between average price per room and arrival month
plt.show()  # show graph
In [62]:
plt.figure(figsize=(11, 8))  # size configurations
sns.boxplot(data=data, x="arrival_month", y="avg_price_per_room")
# boxplot of arrival month and average price per room
  • The average price per room significantly increases from January to May; it dips slightly for two months then slightly increases again till September; a significant dip follows from September to December.
  • June, September, May, August and July respectively had the highest mean price per room.

Outlier Check¶

In [63]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")

plt.figure(figsize=(15, 12))  # size parameters

# generate outlier graphs for the various features
for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
  • Number of week nights, Lead time, Number of previous cancellations, Number of previous books not canceled and average price per room are the columns with high presence of outliers.
  • However, as these are proper values we will not treat them to give a stronger understanding of the data.

Model Building¶

In [64]:
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [65]:
# defining a function to plot the confusion_matrix of a classification model


def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Data Preparation for Modeling¶

In [66]:
X = data.drop(["booking_status"], axis=1)  # drop dependent variable
Y = data["booking_status"]  # assign dependent variable to variable Y

# adding constant
X = sm.add_constant(X)  # add constant to X

X = pd.get_dummies(X, drop_first=True)  # create dummies for X

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)  # split the data into train test in the ratio 70:30 with random_state = 1
X.head()  # first five rows of the dataset
Out[66]:
const no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Meal Plan 3 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 3 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Complementary market_segment_type_Corporate market_segment_type_Offline market_segment_type_Online
0 1.00000 2 0 1 2 0 224 2017 10 2 0 0 0 65.00000 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 1.00000 2 0 2 3 0 5 2018 11 6 0 0 0 106.68000 1 0 0 1 0 0 0 0 0 0 0 0 0 1
2 1.00000 1 0 2 1 0 1 2018 2 28 0 0 0 60.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 1
3 1.00000 2 0 0 2 0 211 2018 5 20 0 0 0 100.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 1
4 1.00000 2 0 1 1 0 48 2018 4 11 0 0 0 94.50000 0 0 0 1 0 0 0 0 0 0 0 0 0 1
In [67]:
# print statements of the number of rows and columns in train and test data set and their relative percentages.
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 28)
Shape of test set :  (10883, 28)
Percentage of classes in training set:
0   0.67064
1   0.32936
Name: booking_status, dtype: float64
Percentage of classes in test set:
0   0.67638
1   0.32362
Name: booking_status, dtype: float64
  • There are 25392 rows and 28 columns in the dataset of the training set and 10883 rows and 28 columns in the testing set.
  • There is about 67.1% of observations belongs to class 0 (Not Failed) and 32.9% observations belongs to class 1 (Failed), and this is preserved in the train and test sets as seen in 67.6% of observations belonging to class 0 and 32.4% belonging to class 1.
In [68]:
X_train.info()  # concise summary of dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25392 entries, 13662 to 33003
Data columns (total 28 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   const                                 25392 non-null  float64
 1   no_of_adults                          25392 non-null  int64  
 2   no_of_children                        25392 non-null  int64  
 3   no_of_weekend_nights                  25392 non-null  int64  
 4   no_of_week_nights                     25392 non-null  int64  
 5   required_car_parking_space            25392 non-null  int64  
 6   lead_time                             25392 non-null  int64  
 7   arrival_year                          25392 non-null  int64  
 8   arrival_month                         25392 non-null  int64  
 9   arrival_date                          25392 non-null  int64  
 10  repeated_guest                        25392 non-null  int64  
 11  no_of_previous_cancellations          25392 non-null  int64  
 12  no_of_previous_bookings_not_canceled  25392 non-null  int64  
 13  avg_price_per_room                    25392 non-null  float64
 14  no_of_special_requests                25392 non-null  int64  
 15  type_of_meal_plan_Meal Plan 2         25392 non-null  uint8  
 16  type_of_meal_plan_Meal Plan 3         25392 non-null  uint8  
 17  type_of_meal_plan_Not Selected        25392 non-null  uint8  
 18  room_type_reserved_Room_Type 2        25392 non-null  uint8  
 19  room_type_reserved_Room_Type 3        25392 non-null  uint8  
 20  room_type_reserved_Room_Type 4        25392 non-null  uint8  
 21  room_type_reserved_Room_Type 5        25392 non-null  uint8  
 22  room_type_reserved_Room_Type 6        25392 non-null  uint8  
 23  room_type_reserved_Room_Type 7        25392 non-null  uint8  
 24  market_segment_type_Complementary     25392 non-null  uint8  
 25  market_segment_type_Corporate         25392 non-null  uint8  
 26  market_segment_type_Offline           25392 non-null  uint8  
 27  market_segment_type_Online            25392 non-null  uint8  
dtypes: float64(2), int64(13), uint8(13)
memory usage: 3.4 MB
  • The x train dataset is made up of numeric column data types - two float columns and integer datatype for the remaining datatype.

Building Logistic Regression Model¶

In [69]:
# fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit()  # fit logistic regression

print(lg.summary())  # print summary of the model
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.425090
         Iterations: 35
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25364
Method:                           MLE   Df Model:                           27
Date:                Fri, 27 Jan 2023   Pseudo R-squ.:                  0.3292
Time:                        15:39:33   Log-Likelihood:                -10794.
converged:                      False   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -922.8266    120.832     -7.637      0.000   -1159.653    -686.000
no_of_adults                             0.1137      0.038      3.019      0.003       0.040       0.188
no_of_children                           0.1580      0.062      2.544      0.011       0.036       0.280
no_of_weekend_nights                     0.1067      0.020      5.395      0.000       0.068       0.145
no_of_week_nights                        0.0397      0.012      3.235      0.001       0.016       0.064
required_car_parking_space              -1.5943      0.138    -11.565      0.000      -1.865      -1.324
lead_time                                0.0157      0.000     58.863      0.000       0.015       0.016
arrival_year                             0.4561      0.060      7.617      0.000       0.339       0.573
arrival_month                           -0.0417      0.006     -6.441      0.000      -0.054      -0.029
arrival_date                             0.0005      0.002      0.259      0.796      -0.003       0.004
repeated_guest                          -2.3472      0.617     -3.806      0.000      -3.556      -1.139
no_of_previous_cancellations             0.2664      0.086      3.108      0.002       0.098       0.434
no_of_previous_bookings_not_canceled    -0.1727      0.153     -1.131      0.258      -0.472       0.127
avg_price_per_room                       0.0188      0.001     25.396      0.000       0.017       0.020
no_of_special_requests                  -1.4689      0.030    -48.782      0.000      -1.528      -1.410
type_of_meal_plan_Meal Plan 2            0.1756      0.067      2.636      0.008       0.045       0.306
type_of_meal_plan_Meal Plan 3           17.3584   3987.836      0.004      0.997   -7798.656    7833.373
type_of_meal_plan_Not Selected           0.2784      0.053      5.247      0.000       0.174       0.382
room_type_reserved_Room_Type 2          -0.3605      0.131     -2.748      0.006      -0.618      -0.103
room_type_reserved_Room_Type 3          -0.0012      1.310     -0.001      0.999      -2.568       2.566
room_type_reserved_Room_Type 4          -0.2823      0.053     -5.304      0.000      -0.387      -0.178
room_type_reserved_Room_Type 5          -0.7189      0.209     -3.438      0.001      -1.129      -0.309
room_type_reserved_Room_Type 6          -0.9501      0.151     -6.274      0.000      -1.247      -0.653
room_type_reserved_Room_Type 7          -1.4003      0.294     -4.770      0.000      -1.976      -0.825
market_segment_type_Complementary      -40.5975   5.65e+05  -7.19e-05      1.000   -1.11e+06    1.11e+06
market_segment_type_Corporate           -1.1924      0.266     -4.483      0.000      -1.714      -0.671
market_segment_type_Offline             -2.1946      0.255     -8.621      0.000      -2.694      -1.696
market_segment_type_Online              -0.3995      0.251     -1.590      0.112      -0.892       0.093
========================================================================================================
In [70]:
print("Training performance:")  # print this statement
model_performance_classification_statsmodels(
    lg, X_train, y_train
)  # show performance of model on training data set
Training performance:
Out[70]:
Accuracy Recall Precision F1
0 0.80600 0.63410 0.73971 0.68285
  • The evaluation results for the model on the training data are
  • Accuracy: 80.6%
  • Recall: 63.4%
  • Precision: 74%
  • F1 score: 68.3%
In [71]:
print("Training performance:")  # print statement
model_performance_classification_statsmodels(
    lg, X_test, y_test
)  # show performance of model on test data set
Training performance:
Out[71]:
Accuracy Recall Precision F1
0 0.80493 0.63260 0.72882 0.67731
  • The evaluation results for the model on the testing data are
  • Accuracy: 80.5%
  • Recall: 63.3%
  • Precision: 72.9%
  • F1 score: 67.7%

The model is performing well on the testing set.

FURTHER EDA¶

In [72]:
# create boxplot and histogram for numeric columns
for col in [
    "no_of_adults",
    "no_of_children",
    "no_of_weekend_nights",
    "no_of_week_nights",
    "lead_time",
    "no_of_previous_cancellations",
    "no_of_previous_bookings_not_canceled",
    "avg_price_per_room",
    "no_of_special_requests",
    "booking_status",
]:
    histogram_boxplot(data, col)  # create histogram and boxplot
In [73]:
for col in [
    "required_car_parking_space",
    "arrival_year",
    "arrival_month",
    "arrival_year",
    "repeated_guest",
    "type_of_meal_plan",
    "room_type_reserved",
    "market_segment_type",
]:
    labeled_barplot(data, col, perc=True)  # create bar graph for the above features
In [74]:
plt.figure(figsize=(12, 7))  # size parameters
sns.pairplot(data, hue="booking_status")
# pair plot for all variables in the dataset
Out[74]:
<seaborn.axisgrid.PairGrid at 0x28654f6b9a0>
<Figure size 864x504 with 0 Axes>

Dealing with Multicollinearity¶

In [75]:
# function to check VIF
def checking_vif(predictors):
    vif = pd.DataFrame()
    vif["feature"] = predictors.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(predictors.values, i)
        for i in range(len(predictors.columns))
    ]
    return vif
In [76]:
checking_vif(X_train)  # display results of function
Out[76]:
feature VIF
0 const 39497686.20788
1 no_of_adults 1.35113
2 no_of_children 2.09358
3 no_of_weekend_nights 1.06948
4 no_of_week_nights 1.09571
5 required_car_parking_space 1.03997
6 lead_time 1.39517
7 arrival_year 1.43190
8 arrival_month 1.27633
9 arrival_date 1.00679
10 repeated_guest 1.78358
11 no_of_previous_cancellations 1.39569
12 no_of_previous_bookings_not_canceled 1.65200
13 avg_price_per_room 2.06860
14 no_of_special_requests 1.24798
15 type_of_meal_plan_Meal Plan 2 1.27328
16 type_of_meal_plan_Meal Plan 3 1.02526
17 type_of_meal_plan_Not Selected 1.27306
18 room_type_reserved_Room_Type 2 1.10595
19 room_type_reserved_Room_Type 3 1.00330
20 room_type_reserved_Room_Type 4 1.36361
21 room_type_reserved_Room_Type 5 1.02800
22 room_type_reserved_Room_Type 6 2.05614
23 room_type_reserved_Room_Type 7 1.11816
24 market_segment_type_Complementary 4.50276
25 market_segment_type_Corporate 16.92829
26 market_segment_type_Offline 64.11564
27 market_segment_type_Online 71.18026
  • All the variables show a VIF < 5, except for the dummy variables in market segment column. Hence the assumption of no multicollinearity is satisfied.

Dropping high p-values.¶

In [77]:
# initial list of columns
cols = X_train.columns.tolist()

# setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    # defining the train set
    x_train_aux = X_train[cols]

    # fitting the model
    model = sm.Logit(y_train, x_train_aux).fit(disp=False)

    # getting the p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    # name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)  # print selected features to be used
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
In [78]:
X_train1 = X_train[selected_features]  # selected features on train data set
X_test1 = X_test[selected_features]  # selected features on test data set
In [79]:
logit1 = sm.Logit(
    y_train, X_train1.astype(float)
)  # train logistic regression on X_train1 and y_train
lg1 = logit1.fit(disp=False)  # fit logistic regression
print(lg1.summary())  # print summary of the model
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         booking_status   No. Observations:                25392
Model:                          Logit   Df Residuals:                    25370
Method:                           MLE   Df Model:                           21
Date:                Fri, 27 Jan 2023   Pseudo R-squ.:                  0.3282
Time:                        15:45:21   Log-Likelihood:                -10810.
converged:                       True   LL-Null:                       -16091.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                           -915.6391    120.471     -7.600      0.000   -1151.758    -679.520
no_of_adults                       0.1088      0.037      2.914      0.004       0.036       0.182
no_of_children                     0.1531      0.062      2.470      0.014       0.032       0.275
no_of_weekend_nights               0.1086      0.020      5.498      0.000       0.070       0.147
no_of_week_nights                  0.0417      0.012      3.399      0.001       0.018       0.066
required_car_parking_space        -1.5947      0.138    -11.564      0.000      -1.865      -1.324
lead_time                          0.0157      0.000     59.213      0.000       0.015       0.016
arrival_year                       0.4523      0.060      7.576      0.000       0.335       0.569
arrival_month                     -0.0425      0.006     -6.591      0.000      -0.055      -0.030
repeated_guest                    -2.7367      0.557     -4.916      0.000      -3.828      -1.646
no_of_previous_cancellations       0.2288      0.077      2.983      0.003       0.078       0.379
avg_price_per_room                 0.0192      0.001     26.336      0.000       0.018       0.021
no_of_special_requests            -1.4698      0.030    -48.884      0.000      -1.529      -1.411
type_of_meal_plan_Meal Plan 2      0.1642      0.067      2.469      0.014       0.034       0.295
type_of_meal_plan_Not Selected     0.2860      0.053      5.406      0.000       0.182       0.390
room_type_reserved_Room_Type 2    -0.3552      0.131     -2.709      0.007      -0.612      -0.098
room_type_reserved_Room_Type 4    -0.2828      0.053     -5.330      0.000      -0.387      -0.179
room_type_reserved_Room_Type 5    -0.7364      0.208     -3.535      0.000      -1.145      -0.328
room_type_reserved_Room_Type 6    -0.9682      0.151     -6.403      0.000      -1.265      -0.672
room_type_reserved_Room_Type 7    -1.4343      0.293     -4.892      0.000      -2.009      -0.860
market_segment_type_Corporate     -0.7913      0.103     -7.692      0.000      -0.993      -0.590
market_segment_type_Offline       -1.7854      0.052    -34.363      0.000      -1.887      -1.684
==================================================================================================
In [80]:
print("Training performance:")  # print statement
model_performance_classification_statsmodels(
    lg1, X_train1, y_train
)  # check performance on X_train1 and y_train
Training performance:
Out[80]:
Accuracy Recall Precision F1
0 0.80545 0.63267 0.73907 0.68174
  • All variables with p-values greater than 0.05 have been removed so we can consider these as our final variables.
  • The performance metrics have not changed after removing the high p-value variables so we can confirm that they do not affect the model significantly.
  • Some coefficients have positive values whilst others have negative values. This means that, an increase in the variables with negative coefficient values will decrease the probability of a booking being canceled and an increase in the variables with positive coefficient will increase the chances of a booking being canceled.

Converting coefficient to odds¶

In [81]:
# converting coefficients to odds
odds = np.exp(lg1.params)

# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns)
Out[81]:
Odds Change_odd%
const 0.00000 -100.00000
no_of_adults 1.11491 11.49096
no_of_children 1.16546 16.54593
no_of_weekend_nights 1.11470 11.46966
no_of_week_nights 1.04258 4.25841
required_car_parking_space 0.20296 -79.70395
lead_time 1.01583 1.58331
arrival_year 1.57195 57.19508
arrival_month 0.95839 -4.16120
repeated_guest 0.06478 -93.52180
no_of_previous_cancellations 1.25712 25.71181
avg_price_per_room 1.01937 1.93684
no_of_special_requests 0.22996 -77.00374
type_of_meal_plan_Meal Plan 2 1.17846 17.84641
type_of_meal_plan_Not Selected 1.33109 33.10947
room_type_reserved_Room_Type 2 0.70104 -29.89588
room_type_reserved_Room_Type 4 0.75364 -24.63551
room_type_reserved_Room_Type 5 0.47885 -52.11548
room_type_reserved_Room_Type 6 0.37977 -62.02290
room_type_reserved_Room_Type 7 0.23827 -76.17294
market_segment_type_Corporate 0.45326 -54.67373
market_segment_type_Offline 0.16773 -83.22724
  • Variables that have odds ratio greater than 1 are more likely to cause the cancellation of a booking. Besides arrival year, type of meal plan not selected has the highest odds ratio hence has a strong association with the outcome of the event.

Performance on training set¶

In [82]:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
In [83]:
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(
    lg1, X_train1, y_train
)  # performance on X_train1 and y_train
log_reg_model_train_perf
Training performance:
Out[83]:
Accuracy Recall Precision F1
0 0.80545 0.63267 0.73907 0.68174
  • The model is doing well since the values of our performance metrics have not changed significantly

ROC-AUC¶

In [84]:
# parameters to plot roc-auc graph
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
  • The graph shows that the model is doing well on the training set.

Improving the model by changing threshold value for AUC-ROC curve.¶

In [85]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)  # show optimal threshold value
0.3700522558707844
  • The optimal threshold value is 0.37
In [86]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)  # confusion matrix for X_train1 and y_train with optimal_threshold_auc_roc as threshold
In [87]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc  # performance on train data set
Training performance:
Out[87]:
Accuracy Recall Precision F1
0 0.79265 0.73622 0.66808 0.70049
  • Accuracy and Precision reduced slightly with the given optimal threshold. However, Recall and the F1 score increased so can be used for the purpose of this analysis.

Using the Precision-Recall to check for better optimal threshold¶

In [88]:
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(
    y_train, y_scores,
)  # axes for precision-recall curve

# parameters to draw curve
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
  • The graph shows an optimal threshold of about 0.42
In [89]:
# setting the threshold
optimal_threshold_curve = 0.42

Model performance on Training set¶

In [90]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_curve,
)  # confusion matrix for X_train1 and y_train with optimal_threshold_curve as threshold
In [91]:
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve  # performance on train data set
Training performance:
Out[91]:
Accuracy Recall Precision F1
0 0.80132 0.69939 0.69797 0.69868
  • The performance values of our metrics have no shown significant changes so we can assume the model is still doing well.

Comparing training performance of different thresholds¶

In [92]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.38 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[92]:
Logistic Regression sklearn Logistic Regression-0.38 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.80545 0.79265 0.80132
Recall 0.63267 0.73622 0.69939
Precision 0.73907 0.66808 0.69797
F1 0.68174 0.70049 0.69868

Performance on test data¶

Using default threshold value¶

In [93]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg1, X_test1, y_test
)  # Complete the code to create confusion matrix for X_test1 and y_test
In [94]:
log_reg_model_test_perf = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)  # check performance on X_test1 and y_test

print("Test performance:")  # print statement
log_reg_model_test_perf  # performance on test data
Test performance:
Out[94]:
Accuracy Recall Precision F1
0 0.79555 0.73964 0.66573 0.70074
  • The values of the performance metrics have still not changed significantly on the test data so we can confirm a good performance on the test data.

ROC curve on test set¶

In [95]:
# parameters to draw curve
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

Using a model with threshold = 0.38¶

In [96]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)  # create confusion matrix for X_test1 and y_test using optimal_threshold_auc_roc as threshold
In [97]:
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc  # performance on test data
Test performance:
Out[97]:
Accuracy Recall Precision F1
0 0.79555 0.73964 0.66573 0.70074
  • No significant changes in the performance metric values.

Using model with threshold=0.42

In [98]:
# creating confusion matrix
confusion_matrix_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_curve
)  # create confusion matrix for X_test1 and y_test using optimal_threshold_curve as threshold
In [99]:
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve  # performance on test data
Test performance:
Out[99]:
Accuracy Recall Precision F1
0 0.80345 0.70358 0.69353 0.69852
  • A 0.42 threshold value caused an increase in Accuracy and Precision while reducing the values for recall and F1 score. The same observation was made on the training data set

Model Performance Summary

In [100]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[100]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.80545 0.79265 0.80132
Recall 0.63267 0.73622 0.69939
Precision 0.73907 0.66808 0.69797
F1 0.68174 0.70049 0.69868
In [101]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[101]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.79555 0.79555 0.80345
Recall 0.73964 0.73964 0.70358
Precision 0.66573 0.66573 0.69353
F1 0.70074 0.70074 0.69852
  • The model performs well on the test data set as it does on the training data set.
  • An F1 score of about 0.7 means that the model is about 70% capable of capturing positive cases (recall) and be accurate with the cases it does capture (precision).
  • The default threshold value gave the best F1 score and recall while the 0.42 threshold value gave the best accuracy and precision value.

Decision Tree¶

In [102]:
X = data.drop(["booking_status"], axis=1)  # drop dependent variable from dataset
Y = data["booking_status"]  # assign variable Y to dependent variable

X = pd.get_dummies(X, drop_first=True)  # create dummies for X

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)  # split the data into train test in the ratio 70:30 with random_state = 1
In [103]:
print(
    "Shape of Training set : ", X_train.shape
)  # number of rows and columns of training set
print("Shape of test set : ", X_test.shape)  # number of rows and columns of testing set
print("Percentage of classes in training set:")  # print statement
print(y_train.value_counts(normalize=True))  # ratio of both classes in training set
print("Percentage of classes in test set:")  # print statement
print(y_test.value_counts(normalize=True))  # ratio of both classes in test set
Shape of Training set :  (25392, 27)
Shape of test set :  (10883, 27)
Percentage of classes in training set:
0   0.67064
1   0.32936
Name: booking_status, dtype: float64
Percentage of classes in test set:
0   0.67638
1   0.32362
Name: booking_status, dtype: float64
  • There are 25392 and 10883 rows respectively in the training and test set and 27 columns in both classes.
In [104]:
X_train.head()  # first five rows of training data set
Out[104]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Meal Plan 3 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 3 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Complementary market_segment_type_Corporate market_segment_type_Offline market_segment_type_Online
13662 1 0 0 1 0 163 2018 10 15 0 0 0 115.00000 0 0 0 0 0 0 0 0 0 0 0 0 1 0
26641 2 0 0 3 0 113 2018 3 31 0 0 0 78.15000 1 0 0 0 1 0 0 0 0 0 0 0 0 1
17835 2 0 2 3 0 359 2018 10 14 0 0 0 78.00000 1 0 0 0 0 0 0 0 0 0 0 0 1 0
21485 2 0 0 3 0 136 2018 6 29 0 0 0 85.50000 0 0 0 1 0 0 0 0 0 0 0 0 0 1
5670 2 0 1 2 0 21 2018 8 15 0 0 0 151.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 1
In [105]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [106]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Building Decision Tree Model¶

In [107]:
model = DecisionTreeClassifier(random_state=1)  # create decision tree
model.fit(X_train, y_train)  ## fit decision tree on train data
Out[107]:
DecisionTreeClassifier(random_state=1)
In [108]:
confusion_matrix_sklearn(model, X_train, y_train)  # create confusion matrix
In [109]:
decision_tree_perf_train_without = model_performance_classification_sklearn(
    model, X_train, y_train
)  # performance of decison tree on training data
decision_tree_perf_train_without
Out[109]:
Accuracy Recall Precision F1
0 0.99421 0.98661 0.99578 0.99117
  • The performance metrics show some very high values on the training data set indicating a good classification of the data points in the training data set.

Model performance on test set.¶

In [110]:
confusion_matrix_sklearn(model, X_test, y_test)  # create confusion matrix
In [111]:
decision_tree_perf_test_without = model_performance_classification_sklearn(
    model, X_test, y_test
)  # performance of decison tree on test data
decision_tree_perf_test_without
Out[111]:
Accuracy Recall Precision F1
0 0.87118 0.81175 0.79461 0.80309
  • The model performs well on the train data but does not perform as well on the test data as the values have significantly reduced.

Decision Tree (with class_weights)¶

In [112]:
model = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model.fit(X_train, y_train)  # fit model on train data using the above set parameters
Out[112]:
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In [113]:
confusion_matrix_sklearn(model, X_train, y_train)
In [114]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train  # performance on train data
)
decision_tree_perf_train
Out[114]:
Accuracy Recall Precision F1
0 0.99311 0.99510 0.98415 0.98960
In [115]:
confusion_matrix_sklearn(model, X_test, y_test)  # create confusion matrix
In [116]:
decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test  # performance on test data
)
decision_tree_perf_test
Out[116]:
Accuracy Recall Precision F1
0 0.86621 0.80494 0.78663 0.79568
  • Model performed well on both train sets with very little error but showed huge disparity in the test set which could hint overfitting on the test data set.

Important features¶

In [117]:
feature_names = list(X_train.columns)  # train data columns
importances = model.feature_importances_  # show their relative importances
indices = np.argsort(importances)

# parameters to draw graph
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Lead time and the average price per room are significantly the most important features in the model.
  • Market segment online, arrival date, number of special requests, arrival month, number of week nights, number of weekend nights, number of adults follow respectively but with relatively muted values.

Pre-pruning¶

In [118]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[118]:
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In [119]:
confusion_matrix_sklearn(estimator, X_train, y_train)  # create confusion matrix
In [120]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)  # performance on train set
decision_tree_tune_perf_train
Out[120]:
Accuracy Recall Precision F1
0 0.83097 0.78608 0.72425 0.75390

Performance on test data¶

In [121]:
confusion_matrix_sklearn(estimator, X_test, y_test)  # create confusion matrix
In [122]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    estimator, X_test, y_test
)  # performance on test set
decision_tree_tune_perf_test
Out[122]:
Accuracy Recall Precision F1
0 0.83497 0.78336 0.72758 0.75444
  • The model performs well on the train and test set as the show similar results on their performance metrics and their values are close to 1.

Visualizing the decision tree¶

In [123]:
feature_names = list(X_train.columns)  # train data columns
importances = estimator.feature_importances_
indices = np.argsort(importances)
In [124]:
# parameters to draw decision tree
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [125]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- weights: [1736.39, 133.59] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- weights: [960.27, 223.16] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- weights: [129.73, 160.92] class: 1
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- weights: [214.72, 227.72] class: 1
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- weights: [82.76, 285.41] class: 1
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- weights: [87.23, 81.98] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- weights: [228.14, 48.58] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- weights: [363.83, 132.08] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- weights: [219.94, 85.01] class: 0
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- weights: [132.71, 280.85] class: 1
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- weights: [158.80, 159.40] class: 1
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- weights: [850.67, 3543.28] class: 1
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- weights: [15.66, 9.11] class: 0
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- weights: [32.06, 19.74] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- weights: [498.03, 44.03] class: 0
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- weights: [258.71, 63.76] class: 0
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- weights: [2512.51, 1451.32] class: 0
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [180.42, 57.69] class: 0
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- weights: [184.90, 56.17] class: 0
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- weights: [106.61, 106.27] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- weights: [3.73, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- weights: [257.96, 62.24] class: 0
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- weights: [213.97, 385.60] class: 1
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- weights: [23.86, 1030.80] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- weights: [7.46, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- weights: [37.28, 4.55] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [20.13, 212.54] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- weights: [231.12, 110.82] class: 0
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- weights: [19.38, 34.92] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

  • The following interpretations can be made using the following extraction decision rules:
  • If lead time is less than or equal to 151.50, number of special requests less than 0.50, market segment type online less than 0.50, lead time less than 90.50,number of weekend nights less than 0.50 and average price per room is greater than 196.50, then the booking is likely to get canceled by the client.
  • We can make other interpretations using the other decision rules.
In [126]:
# importance of features in the tree building

importances = estimator.feature_importances_
indices = np.argsort(importances)

# graph parameters
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • In the pre-pruned tree, Lead time, market segment type online, number of special requests, average price per room, number of adults, number of weekend nights, arrival month, required car space, market segment type offline and number of week nights are the features with some importance in our model.
In [127]:
clf = DecisionTreeClassifier(
    random_state=1, class_weight="balanced"
)  # parameters for setting the cost complexity path
path = clf.cost_complexity_pruning_path(
    X_train, y_train
)  # cost complexity on train data
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [128]:
pd.DataFrame(path)  # output of cost complexity computation
Out[128]:
ccp_alphas impurities
0 0.00000 0.00838
1 0.00000 0.00838
2 0.00000 0.00838
3 0.00000 0.00838
4 0.00000 0.00838
... ... ...
1839 0.00890 0.32806
1840 0.00980 0.33786
1841 0.01272 0.35058
1842 0.03412 0.41882
1843 0.08118 0.50000

1844 rows × 2 columns

In [129]:
# parameters to draw graph for total impurity vs effective alpha for training data
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Now, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [130]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train)  # fit decision tree on training data
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.0811791438913696
In [131]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

F1 Score vs alpha for training and testing sets¶

In [132]:
f1_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = f1_score(y_train, pred_train)
    f1_train.append(values_train)

f1_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = f1_score(y_test, pred_test)
    f1_test.append(values_test)
In [133]:
fig, ax = plt.subplots(figsize=(15, 5))  # size configurations
ax.set_xlabel("alpha")  # label of x-axis
ax.set_ylabel("F1 Score")  # label of y axis
ax.set_title("F1 Score vs alpha for training and testing sets")  # title of plot
ax.plot(
    ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post"
)  # parameters for plot on train data
ax.plot(
    ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post"
)  # parameters for plot on test data
ax.legend()  # plot legend in the graph
plt.show()  # show graph
In [134]:
# Create model to obtain the highest train and test recall
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012267633155167043,
                       class_weight='balanced', random_state=1)

Checking performance on training set¶

In [135]:
confusion_matrix_sklearn(best_model, X_train, y_train)  # create confusion matrix
In [136]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train  # performance of decision tree on train data
)
decision_tree_post_perf_train  # show results
Out[136]:
Accuracy Recall Precision F1
0 0.89954 0.90303 0.81274 0.85551

Checking performance on test set¶

In [137]:
confusion_matrix_sklearn(best_model, X_test, y_test)  # create confusion matrix
In [138]:
decision_tree_post_test = model_performance_classification_sklearn(
    best_model, X_test, y_test
)  # performance of decision tree on test data
decision_tree_post_test  # show results after the pruning
Out[138]:
Accuracy Recall Precision F1
0 0.86879 0.85576 0.76614 0.80848
  • The model is giving generalized results on both data sets as seen indicating good performance on unseen data.

Decision Tree diagram¶

In [139]:
feature_names = list(X_train.columns)
importances = estimator.feature_importances_
indices = np.argsort(importances)
In [140]:
# parameters to show decision tree diagram
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [141]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- weights: [1736.39, 133.59] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- weights: [960.27, 223.16] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- weights: [129.73, 160.92] class: 1
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- weights: [214.72, 227.72] class: 1
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- weights: [82.76, 285.41] class: 1
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- weights: [87.23, 81.98] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- weights: [228.14, 48.58] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- weights: [363.83, 132.08] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- weights: [219.94, 85.01] class: 0
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- weights: [132.71, 280.85] class: 1
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- weights: [158.80, 159.40] class: 1
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- weights: [850.67, 3543.28] class: 1
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- weights: [15.66, 9.11] class: 0
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- weights: [32.06, 19.74] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- weights: [498.03, 44.03] class: 0
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- weights: [258.71, 63.76] class: 0
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- weights: [2512.51, 1451.32] class: 0
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [180.42, 57.69] class: 0
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- weights: [184.90, 56.17] class: 0
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- weights: [106.61, 106.27] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- weights: [3.73, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- weights: [257.96, 62.24] class: 0
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- weights: [213.97, 385.60] class: 1
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- weights: [23.86, 1030.80] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- weights: [7.46, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- weights: [37.28, 4.55] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [20.13, 212.54] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- weights: [231.12, 110.82] class: 0
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- weights: [19.38, 34.92] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

  • The diagram shows matching observations on the pre-pruned and post-pruned trees.
In [142]:
# showing relative importances of features
importances = best_model.feature_importances_
indices = np.argsort(importances)

# graph parameters
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Lead time, Market segment type online, average price per room, number of special requests and arrival month are top five most important features in determining the output of the classification.

Comparing Decision Tree models¶

In [143]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train_without.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[143]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.99421 0.83097 0.89954
Recall 0.98661 0.78608 0.90303
Precision 0.99578 0.72425 0.81274
F1 0.99117 0.75390 0.85551
In [144]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test_without.T,
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree without class_weight",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[144]:
Decision Tree without class_weight Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.87118 0.86621 0.83497 0.86879
Recall 0.81175 0.80494 0.78336 0.85576
Precision 0.79461 0.78663 0.72758 0.76614
F1 0.80309 0.79568 0.75444 0.80848
  • Pre-pruning and post-pruning decision tree models are generally giving high recall values on both the training and testing data sets. However, we will select the post-pruned model as it gave us better values on the performance metrics.

Conclusions and Business Recommendations¶

  • While the lead time is the most significant feature in determing a classification, the online market segment, average price per room and number of special requests also significantly impact the possibility of a client canceling a booking, as observed per our model.
  • Clients who made bookings in longer lead times also made the most cancelations. The hotel could make policies that refuse refunds for cancelations made for bookings after a certain period.
  • When the model identifies a possible cancellation, the hotel can use this information to periodically contact the clients to confirm their booking. This would help them identify to cancelations quicker.
  • October was the month with the most arrivals. However, the first quarter is quite dormant with generally very few arrivals. The price average room price during that period is also generally more expensive. Perhaps, this could explain the dormancy. The hotel can consider adjusting its prices for that period to entice more client bookings. Also, activities could be organised in or around the premises during that period to draw in more clients.
  • The children attendance was quite small. Kids facilities could be installed to encourage families to check in with kids.
  • The adult bookings was mainly for two people so this could suggest that clients were mostly couples. The hotel could take advantage of this by organising events for couples. This could even influence the length of their stay.
  • Required parking space is not a huge priority for clients. If the hotel can afford it, they could use buses or private cars to convey clients to the hotel when they arrive in the city and from the hotel when they check out.
  • There were few stay-ins during weekend nights, perhaps because clients had to prepare for weekday activities. However, further investigation could be done to understand this observation better.
  • Meal plan 1 is the most popular meal plan. The client demographic could be a factor of this observation. The hotel could introduce customer review systems for clients rate services and consequently help the hotel to understand their likes and dislikes. This suggested review system should include a customer recommendation feature if considered for implementation.
  • Room type 1 is the most preferred room type. A review system could also help to understand this observation better.
  • The online market segment are the group which made the most bookings, indicating a good online presence. This should be improved to draw even more clients. Online advertisements is one way to achieve this. The hotel could also develop partnerships with corporate bodies to draw attention from that segment. This could be in the form of sponsorships, discount rates for the use of the hotel's facilities,etc.
  • Not many special requests were made along with bookings. The cost increased with increasing number of special requests so this perhaps deterred client from making them.
  • 32.8% of clients cancel bookings. Though relatively small, this figure translates into more than 10000 cancellations. The hotel took a good initiative in requesting this research to uncover reasons for this.
  • 50% of bookings were made in advance of less than 100 days, which was the average period for this feature.
  • 50% of bookings made cost less than 100 euros.
  • The canceled bookings were generally more expensive than the uncanceled ones. Perhaps, clients found cheaper options which influenced this choice. The hotel could consider exploring competitor rates to help gain more insight into this.
  • Generally clients do not cancel bookings when they make some special requests.
  • The hotel generally has a good returning customer base which suggests that customers who use the place are satisfied enough with the services to make another booking. This is a feather in the hotel's cap.